Show the code
source("utils.R")
theme_set(theme_light())This website is still under active development - all content subject to change
# taken from https://pachterlab.github.io/voyager/articles/visium_10x.html
#spe_vis <- readRDS("../data/spe_spot.rds")
#spe_vis
sfe_full <- SFEData::McKellarMuscleData(dataset = "full")
sfe_full <- mirrorImg(sfe_full, sample_id = "Vis5A", image_id = "lowres")
sfe <- sfe_full[,colData(sfe_full)$in_tissue]
sfe <- sfe[rowSums(counts(sfe)) > 0,]
#perform normalisation
sfe <- scater::logNormCounts(sfe)
# construct the weight matrix using the Voyager function
colGraph(sfe, "visium") <- findVisiumGraph(sfe)Given this data from McKellar et al. we choose two genes to analyse henceforth, named Mdk (ENSMUSG00000027239) and Ncl (ENSMUSG00000026234) (McKellar et al. 2021).
Here we set the arguments for the examples below.
Spot based data is collected along a regular spaced grid where all sample areas have the same size. Such a grid is also called a regular lattice. In more rigorous terms the data \(Y\) is the product of a random process but the sampling locations are fixed along a lattice \(D\). The lattice \(D\) does not have to regular but in the scope of spot based data it is. The main difference of this type of data in comparison to point patterns is, that the locations of the data are then not results of a stochastic process but rather due to a defined sampling strategy (Zuur, Ieno, and Smith 2007).
The lattice is composed of individual spatial units
\[D = \{A_1, A_2,...,A_n\}\]
where these units are not supposed to overlap
\[A_i \cap A_j = \emptyset \forall i \neq j\]
The data is then a random variable of the spatial unit along the lattice
\[Y_i = Y(A_i)\]
Most lattice data analysis technique build on the concept of neighbours. Therefore, the spatial relationship has to be modelled with e.g. a spatial weigth matrix \(W\). There are a lot of ways to define a spatial weigth matrix \(W\). Here, the units that are adjacent are specified with a one and the ones that are not adjacent with a zero (binary coniguity matrix)
\[w_{ij} = \begin{cases} 1 \text{ if } A_i \text{ and } A_j \text{ are adjacent}\\ 0 \text{ otw} \end{cases}\]
other options to specify the weight matrix \(W\) are mentioned in Zuur, Ieno, and Smith (2007).
Voyager has a special function for the construction of the weight matrix in Visium data findVisiumGraph.
Global methods give us an overview over the entire field-of-view and summarize the spatial autocorrelation metric to a single value. The metrics are a function of the weight matrix and the variables of interest. The variables of interest can be gene expression, intensity of a marker or the area of the cell. The global measures can be seen as a weighted average of the local metric, as explained below.
In general, a global spatial autocorrelation measure has the form of a double sum over all locations \(i,j\)
\[\sum_i \sum_j f(x_i,x_j) w_{ij}\]
where \(f(x_i,x_j)\) is the measure of association between features of interest and \(w_{ij}\) scales the relationship by a spatial weight as defined in the weight matrix \(W\). If \(i\) and \(j\) are not neighbours, i.e. we assume they do not have any spatial association, the corresponding element of the weights matrix is 0 (i.e., \(w_{ij} = 0\)). In the following we will see that the function \(f\) varies between the different spatial autocorrelation measures (Zuur, Ieno, and Smith 2007; Pebesma and Bivand 2023).
The global Moran’s I (Moran 1950) coefficient is a measure of spatial autocorrelation, defined as:
\[I = \frac{n}{\sum_i\sum_j w_{ij}} \frac{\sum_i\sum_j w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2}.\]
where \(x_i\) and \(x_j\) represent the values of the variable of interest at locations \(i\) and \(j\), \(\hat{x}\) is the mean of all \(x\) and \(w_{ij}\) is the spatial weight between the locations of \(i\) and \(j\). The expected value is close to \(0\) for large \(n\) (\(\mathbb{E}(I) = -1/(n-1)\)), whereas a value higher than indicates spatial auto-correlation. Negative values indicate negative auto-correlation.
voyagerDataFrame with 2 rows and 2 columns
moran K
<numeric> <numeric>
ENSMUSG00000027239 0.025309 77.87974
ENSMUSG00000026234 0.115698 2.50434
We can also use the moran.mc function to calculate the Moran’s I coefficient. This function uses a Monte Carlo simulation to calculate the p-value.
DataFrame with 2 rows and 12 columns
Ensembl symbol type means
<character> <character> <character> <numeric>
ENSMUSG00000027239 ENSMUSG00000027239 Mdk Gene Expression 0.00500801
ENSMUSG00000026234 ENSMUSG00000026234 Ncl Gene Expression 0.22095353
vars cv2 moran.mc_statistic_Vis5A
<numeric> <numeric> <numeric>
ENSMUSG00000027239 0.00698754 278.6078 0.025309
ENSMUSG00000026234 0.62578406 12.8181 0.115698
moran.mc_parameter_Vis5A moran.mc_p.value_Vis5A
<numeric> <numeric>
ENSMUSG00000027239 177 0.11940299
ENSMUSG00000026234 201 0.00497512
moran.mc_alternative_Vis5A moran.mc_method_Vis5A
<character> <character>
ENSMUSG00000027239 greater Monte-Carlo simulati..
ENSMUSG00000026234 greater Monte-Carlo simulati..
moran.mc_res_Vis5A
<list>
ENSMUSG00000027239 0.0049342,0.0144482,0.0103339,...
ENSMUSG00000026234 0.00985937,-0.01687232,-0.02470626,...
[1] 0.02530897 0.11569815
[1] 0.119402985 0.004975124
We can see both genes have a positive Moran’s I coefficient and a highly significant p-value. The expected value is \(\mathbb{E}(I) = -1/(n-1)\) which is for large \(N\) close to 0. Positive and significant values indicate that areas with similar values are clustered. It is important to note that this could be both at the high or low end of the values of interest. Negative values indicate clustering of alternating values, i.e., gives a measure of spatial heterogeneity. Moreover, one should note that the result is dependent on the weight matrix. Different weight matrices will give different results. To compare Moran’s I coefficients between different values, we need to use the same weight matrix.
Geary’s \(C\) (Geary 1954) is a different measure of global autocorrelation and is very closely related to Moran’s \(I\). However, it focuses on spatial dissimilarity. Geary’s \(C\) is defined by
\[C = \frac{(n-1) \sum_i \sum_j w_{ij}(x_i-x_j)^2}{2\sum_i \sum_j w_{ij}\sum_i(x_i-\bar{x})^2}\]
where \(x_i\) and \(x_j\) represent the values of the variable of interest at locations \(i\) and \(j\), \(\hat{x}\) is the mean of all \(x\), \(w_{ij}\) is the spatial weight between the locations of \(i\) and \(j\) and \(n\) the total numer of locations. The interpretation is opposite to Moran’s \(I\): a value smaller than \(1\) indicates positive auto-correlation whereas a value greater than \(1\) represents negative auto-correlation.
voyagerDataFrame with 2 rows and 18 columns
Ensembl symbol type means
<character> <character> <character> <numeric>
ENSMUSG00000027239 ENSMUSG00000027239 Mdk Gene Expression 0.00500801
ENSMUSG00000026234 ENSMUSG00000026234 Ncl Gene Expression 0.22095353
vars cv2 moran.mc_statistic_Vis5A
<numeric> <numeric> <numeric>
ENSMUSG00000027239 0.00698754 278.6078 0.025309
ENSMUSG00000026234 0.62578406 12.8181 0.115698
moran.mc_parameter_Vis5A moran.mc_p.value_Vis5A
<numeric> <numeric>
ENSMUSG00000027239 177 0.11940299
ENSMUSG00000026234 201 0.00497512
moran.mc_alternative_Vis5A moran.mc_method_Vis5A
<character> <character>
ENSMUSG00000027239 greater Monte-Carlo simulati..
ENSMUSG00000026234 greater Monte-Carlo simulati..
moran.mc_res_Vis5A
<list>
ENSMUSG00000027239 0.0049342,0.0144482,0.0103339,...
ENSMUSG00000026234 0.00985937,-0.01687232,-0.02470626,...
geary.mc_statistic_Vis5A geary.mc_parameter_Vis5A
<numeric> <numeric>
ENSMUSG00000027239 0.960884 21
ENSMUSG00000026234 0.881768 1
geary.mc_p.value_Vis5A geary.mc_alternative_Vis5A
<numeric> <character>
ENSMUSG00000027239 0.10447761 greater
ENSMUSG00000026234 0.00497512 greater
geary.mc_method_Vis5A geary.mc_res_Vis5A
<character> <list>
ENSMUSG00000027239 Monte-Carlo simulati.. 1.034019,0.992434,0.982648,...
ENSMUSG00000026234 Monte-Carlo simulati.. 0.99592,1.01737,1.02285,...
[1] 0.02530897 0.11569815
[1] 0.119402985 0.004975124
Again, the value of Geary’s \(C\) indicates that the genes are spatially auto-correlated.
The global \(G\) (Getis and Ord 1992) statistic is a generalisation of the local version (see below) and summarises the contributions of all pairs of values \((x_i, x_j)\) in the dataset. Formally that is
\[G(d) = \frac{\sum_{i = 1}^n \sum_{j=1}^n w_{ij}(d)x_ix_j}{\sum_{i = 1}^n \sum_{j=1}^n x_i x_j} \text{s.t } j \neq i.\]
The global \(G(d)\) statistic is very similar to global Moran’s \(I\). The global \(G(d)\) statistic is based on the sum of the products of the datapoints whereas global Moran’s \(I\) is based on the sum of the covariances. Since these two approaches capture different aspects of a structure, their values will differ as well. A good approach would be to not use one statistic in isolation but rather consider both.
It is recommended to use binary weights for this calculation. We will use the spdep package directly to calculate the global \(G\) statistic.
Getis-Ord global G statistic
data: counts(sfe)[features[1], ]
weights: weights_neighbourhoods_binary
standard deviate = 0.4597, p-value = 0.3229
alternative hypothesis: greater
sample estimates:
Global G statistic Expectation Variance
1.659292e-03 1.074114e-03 1.620386e-06
Unlike global measures that give an overview over the entire field of view, local measures report information about the statistic at each location (cell). There exist local analogs of Moran’s I and Geary’s C for which the global statistic can be represented as a weighted sum of the local statistics. As above, the local coefficients are based on both the spatial weights matrix and the values of the measurement of interest.
The local Moran’s I coefficient (Anselin 1995) is a measure of spatial autocorrelation on each location of interest. It is defined as:
\[I_i = \frac{x_i - \bar{x}}{\sum_{k=1}^n(x_k-\bar{x})^2/(n-1)} \sum_{j=1}^n w_{ij}(x_j - \bar{x})\]
where the index \(i\) refers to the location for which the measure is calculated. The interpretation is analogous to the global Moran’s I where a value of \(I_i\) higher than \(\mathbb{E}(I) = -1/(n-1)\) indicates spatial auto-correlation; smaller values indicate negative auto-correlation. It is important to note that, as for the global counterpart, the value of local Moran’s I could be a result from both the high or low end of the values. Since we measure and test a large number of locations simultaneously, we need to correct for multiple testing (e.g., using the Benjamini-Hochberg procedure).
voyagerSimilar to local Moran’s I, there is a local Geary’s C (Anselin 1995) coefficient. It is defined as
\[C_i = \sum_{j=1}^n w_{ij}(x_i-x_j)^2\]
The interpretation is analogous to the global Geary’s C (value less than \(1\) indicates positive auto-correlation, a value greater than \(1\) highlights negative auto-correlation).
In this example, we will not plot the local Geary’s C coefficient for gene expression but for features that are associated with an individual cell, e.g., the number of counts or the number of genes expressed. For this, the colDataUnivariate function is used to calculate the local Geary’s C coefficient for such features.
VoyagerThe local Getis-Ord \(G_i\) (J. K. Ord and Getis 1995; Getis and Ord 1992) statistic quantifies the weighted concentration of points within a radius \(d\) and in a local region \(i\), according to:
\[G_i(d) = \frac{\sum_{j \neq i } w_{ij}(d)x_j}{\sum_{j \neq i} x_j}\]
There is a variant of this statistic, \(G_i^*(d)\), which is the same as \(G_i(d)\) except that the contribution when \(j=i\) is included in the term.
voyagerThe results above gives an estimate of the local Getis-Ord statistic for each cell, but no significance value. This can be done by using a permutation approach using the localG_perm argument.
Positive values indicate clustering of high values, i.e., hot spots, and negative values indicate clustering of low values, i.e., cold spots. The method does not detect outlier values because, unlike in local Moran’s I, there is no cross-product between \(i\) and \(j\). But unlike local Moran’s I, we know the type of interaction (high-high or low-low) between \(i\) and \(j\).
The local spatial heteroscedasticity (LOSH) is a measure of spatial autocorrelation that is based on the variance of the local neighbourhood. Unlike the other measures, this method does not assume homoscedastic variance over the whole tissue region. LOSH is defined as:
\[H_i(d) = \frac{\sum_j w_{ij}(d)|e_j(d)|^a}{\sum_j w_{ij}(d)}\]
where \(e_j(d) = x_j - \bar{x}_i(d), j \in N(i,d)\) are the local residuals that are subtracted from the local mean. The power \(a\) modulates the interpretation of the residuals (\(a=1\): residuals are interpreted as absolute deviations from the local mean; \(a=2\): residuals are interpreted as deviations from the local variance).
The LOSH should be interpreted in combination with the local Getis-Ord \(G_i^*\) statistic. The \(G_i^*\) quantifies the local mean of the variable of interest, while \(H_i\) quantifies the local variance. This table provided by Ord and Getis (J. Keith Ord and Getis 2012) summarizes the interpretation of the combination of \(G_i^*\) and \(H_i\).
| high \(H_i\) | low \(H_i\) | |
|---|---|---|
| large \(\|G_i^*\|\) | A hot spot with heterogeneous local conditions | A hot spot with similar surrounding areas; the map would indicate whether the affected region is larger than the single “cell” |
| small $ |G_i^*| $ | Heterogeneous local conditions but at a low average level (an unlikely event) | Homogeneous local conditions and a low average level |
VoyagerThe local methods presented above should always be interpreted with care, since we face the problem of multiple testing when calculating them for each cell. Moreover, the presented methods should mainly serve as exploratory measures to identify interesting regions in the data. Multiple processes can lead to the same pattern, thus from identifying the pattern we cannot infer the underlying process. Indication of clustering does not explain why this occurs. On the one hand, clustering can be the result of spatial interaction between the variables of interest. We have an accumulation of a gene of interest in one region of the tissue. On the other hand clustering can be the result spatial heterogeneity, when local similarity is created by structural heterogeneity in the tissue, e.g., that cells with uniform expression of a gene of interest are grouped together which then creates the apparent clustering of the gene expression measurement.
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] magrittr_2.0.3 stringr_1.5.0
[3] dixon_0.0-8 splancs_2.01-44
[5] spdep_1.2-8 spData_2.3.0
[7] tmap_3.3-4 scater_1.28.0
[9] scran_1.28.2 scuttle_1.10.3
[11] SFEData_1.2.0 SpatialFeatureExperiment_1.2.3
[13] Voyager_1.2.7 rgeoda_0.0.10-4
[15] digest_0.6.33 ncf_1.3-2
[17] sf_1.0-16 reshape2_1.4.4
[19] patchwork_1.2.0 STexampleData_1.8.0
[21] ExperimentHub_2.8.1 AnnotationHub_3.8.0
[23] BiocFileCache_2.8.0 dbplyr_2.3.4
[25] RANN_2.6.1 seg_0.5-7
[27] sp_2.1-1 rlang_1.1.1
[29] ggplot2_3.5.1 dplyr_1.1.3
[31] mixR_0.2.0 spatstat_3.0-6
[33] spatstat.linnet_3.1-1 spatstat.model_3.2-6
[35] rpart_4.1.19 spatstat.explore_3.2-3
[37] nlme_3.1-162 spatstat.random_3.1-6
[39] spatstat.geom_3.2-5 spatstat.data_3.0-1
[41] SpatialExperiment_1.10.0 SingleCellExperiment_1.22.0
[43] SummarizedExperiment_1.30.2 Biobase_2.60.0
[45] GenomicRanges_1.52.1 GenomeInfoDb_1.36.4
[47] IRanges_2.34.1 S4Vectors_0.38.2
[49] BiocGenerics_0.46.0 MatrixGenerics_1.12.3
[51] matrixStats_1.0.0
loaded via a namespace (and not attached):
[1] spatstat.sparse_3.0-2 bitops_1.0-7
[3] httr_1.4.7 RColorBrewer_1.1-3
[5] tools_4.3.1 utf8_1.2.3
[7] R6_2.5.1 HDF5Array_1.28.1
[9] mgcv_1.9-1 rhdf5filters_1.12.1
[11] withr_2.5.1 gridExtra_2.3
[13] leaflet_2.2.0 leafem_0.2.3
[15] cli_3.6.1 labeling_0.4.3
[17] proxy_0.4-27 dbscan_1.1-11
[19] R.utils_2.12.2 dichromat_2.0-0.1
[21] scico_1.5.0 limma_3.56.2
[23] rstudioapi_0.15.0 RSQLite_2.3.1
[25] generics_0.1.3 crosstalk_1.2.0
[27] Matrix_1.5-4.1 ggbeeswarm_0.7.2
[29] fansi_1.0.5 abind_1.4-5
[31] R.methodsS3_1.8.2 terra_1.7-55
[33] lifecycle_1.0.3 yaml_2.3.7
[35] edgeR_3.42.4 rhdf5_2.44.0
[37] tmaptools_3.1-1 grid_4.3.1
[39] blob_1.2.4 promises_1.2.1
[41] dqrng_0.3.1 crayon_1.5.2
[43] lattice_0.21-8 beachmat_2.16.0
[45] KEGGREST_1.40.1 magick_2.8.0
[47] pillar_1.9.0 knitr_1.44
[49] metapod_1.7.0 rjson_0.2.21
[51] boot_1.3-28.1 codetools_0.2-19
[53] wk_0.8.0 glue_1.6.2
[55] vctrs_0.6.4 png_0.1-8
[57] gtable_0.3.4 cachem_1.0.8
[59] xfun_0.40 S4Arrays_1.0.6
[61] mime_0.12 DropletUtils_1.20.0
[63] units_0.8-4 statmod_1.5.0
[65] bluster_1.10.0 interactiveDisplayBase_1.38.0
[67] ellipsis_0.3.2 bit64_4.0.5
[69] filelock_1.0.2 irlba_2.3.5.1
[71] vipor_0.4.5 KernSmooth_2.23-21
[73] colorspace_2.1-0 DBI_1.1.3
[75] raster_3.6-26 tidyselect_1.2.0
[77] bit_4.0.5 compiler_4.3.1
[79] curl_5.1.0 BiocNeighbors_1.18.0
[81] DelayedArray_0.26.7 scales_1.3.0
[83] classInt_0.4-10 rappdirs_0.3.3
[85] goftest_1.2-3 spatstat.utils_3.0-5
[87] rmarkdown_2.25 XVector_0.40.0
[89] htmltools_0.5.6.1 pkgconfig_2.0.3
[91] base64enc_0.1-3 sparseMatrixStats_1.12.2
[93] fastmap_1.1.1 htmlwidgets_1.6.2
[95] shiny_1.7.5.1 DelayedMatrixStats_1.22.6
[97] farver_2.1.1 jsonlite_1.8.7
[99] BiocParallel_1.34.2 R.oo_1.25.0
[101] BiocSingular_1.16.0 RCurl_1.98-1.12
[103] GenomeInfoDbData_1.2.10 s2_1.1.4
[105] Rhdf5lib_1.22.1 munsell_0.5.0
[107] Rcpp_1.0.11 ggnewscale_0.4.9
[109] viridis_0.6.4 stringi_1.7.12
[111] leafsync_0.1.0 zlibbioc_1.46.0
[113] plyr_1.8.9 parallel_4.3.1
[115] ggrepel_0.9.4 deldir_1.0-9
[117] Biostrings_2.68.1 stars_0.6-4
[119] splines_4.3.1 tensor_1.5
[121] locfit_1.5-9.8 igraph_1.5.1
[123] ScaledMatrix_1.8.1 BiocVersion_3.17.1
[125] XML_3.99-0.14 evaluate_0.22
[127] BiocManager_1.30.22 httpuv_1.6.11
[129] purrr_1.0.2 polyclip_1.10-6
[131] rsvd_1.0.5 lwgeom_0.2-13
[133] xtable_1.8-4 e1071_1.7-13
[135] RSpectra_0.16-1 later_1.3.1
[137] viridisLite_0.4.2 class_7.3-22
[139] tibble_3.2.1 memoise_2.0.1
[141] beeswarm_0.4.0 AnnotationDbi_1.62.2
[143] cluster_2.1.4
©2024 The pasta authors. Content is published under Creative Commons CC-BY-4.0 License for the text and GPL-3 License for any code.